Skip to content

Address residual P1+P2s from re-audit of PR #412#422

Merged
igerber merged 2 commits into
mainfrom
fix-audit-412
May 13, 2026
Merged

Address residual P1+P2s from re-audit of PR #412#422
igerber merged 2 commits into
mainfrom
fix-audit-412

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 13, 2026

Summary

Audit follow-up to PR #412. The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds:

  • P1 (REGISTRY + code comment) - the claim that "R does not ship per-path `predict_het` on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher does: it forwards `predict_het` into each per-path `did_multiplegt_main` call alongside `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces as an explicit Python-side deferral, NOT a verified R-parity. TODO row added to track validating R's actual output and either implementing parity or documenting the deviation explicitly.

  • P2 (REGISTRY rtol claim) - the per-path heterogeneity R-parity paragraph claimed `rtol ~1e-6 on point estimates AND SE`, but parity tests use `BETA_RTOL=1e-6` and `SE_RTOL=1e-5`. Split the claim and note the WLS-denominator/cohort-recentering numerical drift motivating the looser SE bound.

  • P2 (replicate-weight df_survey refresh test gap) - the existing test `test_per_path_heterogeneity_replicate_weights_propagates_n_valid` would have passed if the new dedicated refresh loop failed to recompute `t_stat` / `p_value` / `conf_int` at the final `df_survey`. Strengthen to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match.

  • P2 (paths_of_interest survey gap) - the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression-locked (all survey-specific tests used `by_path=k`). Add two new tests: analytical Binder TSL coverage with selector-ordering preservation, and the multiplier-bootstrap gate under `paths_of_interest`.

No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only.

Test plan

  • CI - new tests run as part of the standard slow-test matrix.
  • Locally verified: `pytest tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity::test_paths_of_interest_heterogeneity_{survey_design_analytical,survey_n_bootstrap_gate} tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity::test_per_path_heterogeneity_replicate_weights_propagates_n_valid -m ''` all pass.

🤖 Generated with Claude Code

igerber and others added 2 commits May 12, 2026 20:14
The restored CI reviewer surfaced findings the degraded reviewer
missed across all 5 prior rounds on PR #412:

P1 (REGISTRY + code comment): the claim that "R does not ship per-path
predict_het on placebos either, so parity is preserved by deferral"
contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)`
dispatcher actually does - it forwards `predict_het` into each
per-path `did_multiplegt_main` call along with `placebo`, so R may
emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite
both surfaces (chaisemartin_dhaultfoeuille_results.py code comment and
REGISTRY.md DataFrame-integration paragraph) as an explicit Python-
side deferral rather than a verified R-parity. Add a TODO row to
track validating R's actual placebo predict_het output and either
implementing parity or documenting the deviation explicitly.

P2 (REGISTRY rtol claim): the per-path heterogeneity R-parity
paragraph claimed "rtol ~1e-6 on point estimates AND SE", but the
parity tests use BETA_RTOL=1e-6 and SE_RTOL=1e-5 (one decade looser
on SE). Split the claim into the two separate tolerances and note
the WLS-denominator/cohort-recentering numerical drift that motivates
the looser SE bound.

P2 (replicate-weight df_survey refresh): the existing test only
checked finite SE; it would have passed if the new dedicated
heterogeneity refresh loop failed to recompute t_stat / p_value /
conf_int at the final df_survey. Strengthen the test to call
`safe_inference(beta, se, df=df_survey)` on the first finite entry
and assert the stored inference fields match - this anti-regression
covers the dedicated post-call refresh added for path_heterogeneity_
effects.

P2 (paths_of_interest survey gap): the documented composability of
`paths_of_interest + heterogeneity + survey_design` was not regression-
locked - all existing survey-specific tests used `by_path=k`. Add
test_paths_of_interest_heterogeneity_survey_design_analytical (verify
analytical Binder TSL fits, selector ordering preserved, finite SE
per populated (path, l)) and test_paths_of_interest_heterogeneity_
survey_n_bootstrap_gate (verify the multiplier-bootstrap gate applies
under paths_of_interest too).

No estimator behavior, weighting, variance/SE, identification, or
default statistical surface changed in source - documentation
accuracy plus expanded regression coverage only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e9f1aa861fd043ea6b276b57c7ec4363d1e5bbe


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The diff does not change dCDH estimator logic, weighting, variance/SE computation, identification assumptions, or defaults; it changes documentation/comments and adds regression coverage around by_path heterogeneity.
  • Affected methodology surface: dCDH per-path heterogeneity testing (predict_het, Web Appendix Section 1.5 / Lemma 7) and the by-path placebo-row export path.
  • The registry/comment updates correctly stop claiming verified R parity for per-path placebo heterogeneity and instead document the current Python behavior as a tracked deferral. docs/methodology/REGISTRY.md:L643-L645, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1862-L1903, TODO.md:L63
  • The rewritten parity-tolerance language matches the actual parity test constants (BETA_RTOL=1e-6, SE_RTOL=1e-5). tests/test_chaisemartin_dhaultfoeuille_parity.py:L1404-L1469
  • The new tests target the right regressions: final df_survey refresh on path_heterogeneity_effects, paths_of_interest + heterogeneity + survey_design, and the survey + multiplier-bootstrap gate under paths_of_interest. tests/test_chaisemartin_dhaultfoeuille.py:L10577-L10703
  • I could not execute the tests in this environment because pytest is not installed.

Methodology

  • Severity: P3. Impact: The remaining per-path placebo heterogeneity gap is now accurately documented as a Python-side deferral rather than an asserted R-parity fact; this matches the live behavior, where placebo rows in to_dataframe(level="by_path") still emit NaN het_* fields. Concrete fix: None required for approval; future parity work is already tracked. docs/methodology/REGISTRY.md:L643-L645, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1862-L1903, TODO.md:L63
  • No P0/P1 methodology findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: The new TODO entry properly tracks the unresolved placebo-heterogeneity parity work, so it is mitigated under the project’s deferred-work policy rather than remaining implicit. Concrete fix: None required in this PR. TODO.md:L63

Security

  • No findings.

Documentation/Tests

  • No findings. The added tests are well-targeted to the touched risks, especially the dedicated final-df refresh loop in fit(). diff_diff/chaisemartin_dhaultfoeuille.py:L3957-L4070, tests/test_chaisemartin_dhaultfoeuille.py:L10577-L10703
  • Verification note: test execution was not possible here because neither pytest nor python -m pytest is available in the review environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 13, 2026
@igerber igerber merged commit 9a2435a into main May 13, 2026
31 of 32 checks passed
@igerber igerber deleted the fix-audit-412 branch May 13, 2026 10:25
igerber added a commit that referenced this pull request May 15, 2026
Two informational findings from the latest review:

1. Stale `to_dataframe(level="by_path")` docstring at
   `chaisemartin_dhaultfoeuille_results.py:1530-1558` still claimed
   placebo `het_*` columns are NaN. Updated to document the post-#422
   contract: positive-horizon AND negative-horizon (placebo) rows are
   both populated when `placebo=True + heterogeneity=` are co-set;
   placebo rows under `survey_design` remain NaN with a fit-time
   UserWarning.

2. JSON golden-fixture type instability for empty `placebo_predict_het`
   / `placebo_horizons` slots. R's `jsonlite::toJSON` serializes
   plain `list()` as `[]` (array) but populated named lists as `{}`
   (object), so consumers iterating `.items()` on the slot saw
   different shapes across scenarios. Fixed both ends:
   - R-side: extractors initialize empty slots with
     `structure(list(), names = character(0))` which jsonlite
     serializes as `{}` even when empty. Verified across 4 scenarios
     (20, 21, 22, 23) — all `placebo_predict_het` / `placebo_horizons`
     slots now serialize as objects regardless of population.
   - Python-side: added `_as_dict` helper in the parity test module
     as a defensive backstop coercing any non-dict (None / [] /
     missing) to {}. Used at the two call sites that read optional
     placebo slots so consumers can call `.items()` uniformly.

The golden JSON regenerated; type-stable across all scenarios (verified
via `jq` on each scenario's predict_het type).

314 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 15, 2026
…tract

Two stale in-code documentation surfaces from the latest review:

1. `chaisemartin_dhaultfoeuille.py:980-989` — the `heterogeneity`
   parameter docstring on `fit()` still said "post-treatment regressions
   only (no placebo regressions)". Updated to document the post-#422
   contract: per-horizon OLS regressions on forward AND backward
   (placebo) horizons when `placebo=True`; survey_design composes with
   forward horizons but warns + skips backward horizons until the
   pre-period cell allocator is derived.

2. `chaisemartin_dhaultfoeuille_results.py:1279-1286` — the heterogeneity
   summary note didn't mention the survey forward-only fallback.
   Extended the note to cover the gating semantics so users reading
   `result.summary()` under `survey_design + heterogeneity` know what
   they're getting.

Both surfaces now match the contract already documented in the API rst,
REGISTRY, and CHANGELOG. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request May 18, 2026
A holistic codex review of the merged igerber#412 + cleanup igerber#422 state
surfaced three documentation/test gaps that the per-PR cleanup review
path could not see (it only scopes to the cleanup diff). All three
are at-most P3 in severity but each is real claim-vs-coverage drift.

1. REGISTRY's top-level `Note (Phase 3 by_path ...)` `to_dataframe(
   level="by_path")` schema list omits the `het_*` columns
   (`het_beta`, `het_se`, `het_t_stat`, `het_p_value`,
   `het_conf_int_lower`, `het_conf_int_upper`) that `_to_dataframe`
   has always emitted since the Phase 5 heterogeneity wave landed.
   Add them to the schema list so the registry contract matches the
   implementation.

2. The two new parity tests
   (`TestDCDHDynRParityHeterogeneity`,
    `TestDCDHDynRParityByPathHeterogeneity`) assert only `beta` and
   `se` from the R golden payload, leaving `t_stat`, `p_value`,
   `conf_int`, and `n_obs` unpinned. A regression in the inference
   extraction or final-`df_survey` refresh could ship while parity
   still passes. Pin `t_stat` at `SE_RTOL` (invariant to critical-
   value distribution since `t = beta / se`) and `n_obs` exactly.

3. While extending the parity assertions, surfaced a real Python-vs-R
   structural deviation that was undocumented:
   `_compute_heterogeneity_test` passes `df=None` to `safe_inference`,
   so Python uses the normal Z critical value (~1.96) for `p_value`
   and `conf_int`. R `did_multiplegt_dyn(..., predict_het)` uses the
   t-distribution with df = n - k from the WLS regression. The
   structural gap produces ~0.1-2% rtol gaps on CIs and p-values
   that exceed `SE_RTOL` (verified empirically on the parity fixture:
   CI gap ~0.17% on h=1). Document the deviation in the heterogeneity
   R-parity Note. Pin only `beta`, `se`, `t_stat`, `n_obs` in the
   parity tests; `p_value` and `conf_int` parity intentionally
   skipped. Add a TODO row tracking the optional df-threading work.

No estimator behavior, weighting, variance/SE computation, or
default-statistical surface changed - documentation accuracy plus
expanded regression coverage only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDL77 pushed a commit to TDL77/diff-diff that referenced this pull request May 18, 2026
Closes TODO igerber#422 + pilot-412 in a single PR (same surface, same R fixture,
same parity story).

Phase 0 probe verified R behavior: did_multiplegt_dyn(by_path, predict_het,
placebo) emits per-path heterogeneity OLS results on backward (placebo)
horizons via R's per-by_level dispatcher (DIDmultiplegtDYN:::did_multiplegt_main
placebo block at the `effect = matrix(-i, ...)` rbind site). New scenario 22
in benchmarks/R/generate_dcdh_dynr_test_values.R captures this with
predict_het=list("het_x", c(-1)) — the c(-1) sentinel triggers "compute
heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo)
positions" per the R source path read at script time.

Phase 1A implementation (non-survey): _compute_heterogeneity_test gains a
placebo: int = 0 parameter and iterates forward (1..L_max) and backward
(-1..-placebo) horizons in a single loop. Explicit `if out_idx < 0: continue`
eligibility guard prevents numpy negative-index silent wrap on N_mat[g, out_idx]
when F_g - 1 + l_h < 0. _compute_path_heterogeneity_test forwards the param;
fit() passes placebo=L_max if self.placebo else 0 to both global and per-path
call sites. to_dataframe(level="by_path") placebo rows now read het_* values
from path_heterogeneity_effects negative-int keys (mirroring the existing
path_placebo_event_study negative-key convention) instead of the pre-PR
hardcoded NaN-fill.

Survey gate: when survey_design is active AND placebo > 0 + heterogeneity is
requested, _compute_heterogeneity_test raises NotImplementedError eagerly with
a documented message. The Binder TSL cell-period allocator's REGISTRY
justification is tied to post-period attribution; backward-horizon attribution
puts ψ_g mass on a pre-period cell, which is a separate library-extension
claim that needs its own derivation. Forward-horizon predict_het + survey
continues to work unchanged. Pre-period allocator derivation tracked as a
new follow-up TODO row.

Phase 2 (df threading): _compute_heterogeneity_test now passes
df = n_obs - n_params to safe_inference on the non-survey OLS path, matching
R did_multiplegt_dyn(predict_het=...)'s t-distribution inference
(qt(0.975, df.residual(model)) site). Pre-PR Python used df=None (Z critical),
producing 0.1-2% rtol gaps on p_value/conf_int vs R. Existing forward-horizon
parity tests now pin t/p/CI at INFERENCE_RTOL=1e-4 (was unpinned). Rank-
deficient designs use design.shape[1] as df denominator (pre-drop column
count); fully rank-deficient is NaN-short-circuited by the existing guard.
Near-rank-deficient edge case tracked as a new Low TODO follow-up.

R parity: scenario 22 (multi_path_reversible_predict_het_with_placebo,
placebo=2, effects=3, by_path=3) pinned at BETA_RTOL=1e-6/SE_RTOL=1e-5 for
beta/se/t_stat/n_obs and INFERENCE_RTOL=1e-4 for p_value/conf_int across
3 paths × (3 forward + 2 placebo) = 15 horizons.

Cross-surface tests (TestByPathPredictHetPlacebo): placebo het column
population, survey-gate NotImplementedError, forward+survey anti-regression,
out_idx<0 eligibility guard, single-path telescope
(path_heterogeneity_effects[(only_path,)] == heterogeneity_effects bit-
exactly), summary rendering. The two existing local-invariant tests
(test_*_inference_matches_safe_inference) refactored to verify SE-derivation
wiring (t_stat=beta/se, conf_int symmetric around beta, p_value in [0,1])
without back-deriving n_params.

REGISTRY: heterogeneity Z-vs-t deviation note replaced with positive
"R parity (post-2026-05-15 df threading)" framing including the rank-
deficient caveat. New "Per-path placebo heterogeneity" Note documents the
R parity, syntax requirements (c(-1) sentinel), survey gate, and test
anchors. CHANGELOG entry under [Unreleased]. llms-full.txt by_path entry
extended with placebo het composition + survey-gate mention. API rst extended
with the same.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant